\[~\]
\[~\]
\[~\]
\[~\]
\[~\]
\[~\]
\[~\]
\[~\]
Nowadays, with the improvement of the technologies there are an amount of data that allows us to understand if there is any problem or not in the healthy of a person.
Here in this project we explain and we try to predict whether a patient should be diagnosed with Heart Disease or not.
\[~\]
\[~\]
\[~\]
Kaggle, is the main platform where I found this interesting dataset where applying our main bayesian inference as the main scope of this project.
This dataset is composed by these features:
##
## -- Column specification --------------------------------------------------------
## cols(
## age = col_double(),
## sex = col_double(),
## cp = col_double(),
## trtbps = col_double(),
## chol = col_double(),
## fbs = col_double(),
## restecg = col_double(),
## thalachh = col_double(),
## exng = col_double(),
## oldpeak = col_double(),
## slp = col_double(),
## caa = col_double(),
## thall = col_double(),
## output = col_double()
## )
## # A tibble: 6 x 14
## age sex cp trtbps chol fbs restecg thalachh exng oldpeak slp
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 63 1 3 145 233 1 0 150 0 2.3 0
## 2 37 1 2 130 250 0 1 187 0 3.5 0
## 3 41 0 1 130 204 0 0 172 0 1.4 2
## 4 56 1 1 120 236 0 1 178 0 0.8 2
## 5 57 0 0 120 354 0 1 163 1 0.6 2
## 6 57 1 0 140 192 0 1 148 0 0.4 1
## # ... with 3 more variables: caa <dbl>, thall <dbl>, output <dbl>
\[~\]
Analyzing these features, we captured some interesting information for each feature within the dataset:
\[~\]
As we can see above, we have qualitative and quantitative data that will be well explained in the next subchapters.
The main features detected have these summaries:
\[~\]
## age sex cp trtbps
## Min. :29.00 Min. :0.0000 Min. :0.0000 Min. : 94.0
## 1st Qu.:48.00 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:120.0
## Median :55.50 Median :1.0000 Median :1.0000 Median :130.0
## Mean :54.42 Mean :0.6821 Mean :0.9636 Mean :131.6
## 3rd Qu.:61.00 3rd Qu.:1.0000 3rd Qu.:2.0000 3rd Qu.:140.0
## Max. :77.00 Max. :1.0000 Max. :3.0000 Max. :200.0
## chol fbs restecg thalachh
## Min. :126.0 Min. :0.000 Min. :0.0000 Min. : 71.0
## 1st Qu.:211.0 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:133.2
## Median :240.5 Median :0.000 Median :1.0000 Median :152.5
## Mean :246.5 Mean :0.149 Mean :0.5265 Mean :149.6
## 3rd Qu.:274.8 3rd Qu.:0.000 3rd Qu.:1.0000 3rd Qu.:166.0
## Max. :564.0 Max. :1.000 Max. :2.0000 Max. :202.0
## exng oldpeak slp caa
## Min. :0.0000 Min. :0.000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:0.0000
## Median :0.0000 Median :0.800 Median :1.000 Median :0.0000
## Mean :0.3278 Mean :1.043 Mean :1.397 Mean :0.7185
## 3rd Qu.:1.0000 3rd Qu.:1.600 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :1.0000 Max. :6.200 Max. :2.000 Max. :4.0000
## thall output
## Min. :0.000 Min. :0.000
## 1st Qu.:2.000 1st Qu.:0.000
## Median :2.000 Median :1.000
## Mean :2.315 Mean :0.543
## 3rd Qu.:3.000 3rd Qu.:1.000
## Max. :3.000 Max. :1.000
\[~\] As we can see above:
…Behaviours of the other features within the dataset could be seen in this summary, so let’s go to the data visualization for having a good visualizations of which data we are treating!
\[~\]
\[~\]
In this subsection we want to describe graphically the dataset that we are analyzing to have a first taste of what is the better model to use in this case, we want to show below some interesting plots.
\[~\]
\[~\]
PCA is a particular (visualization, sometimes) tool that allows us to show the patients similar each other. Essentially, the principal component analysis (PCA) is the process of computing the principal components and using them to perform a change of basis on the data, sometimes using only the first few principal components and ignoring the rest. Also PCA is used in exploratory data analysis and for making predictive models. It is commonly used for dimensionality reduction by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data’s variation as possible.
\[~\]
\[~\]
Its normal to denote that there could be some losses on this representation due to the amount of features that we want to consider, but here we actually recover 89.8% of variance in the entire dataset using two principal components, so this is a good preservation of the result.
\[~\]
\[~\]
Here, we want to analyze the percentages of each categorical data:
\[~\]
The most relevant is the typical angina (47.4%) that is defined as substernal chest pain precipitated by physical exertion or emotional stress and relieved with rest or nitroglycerin (represents a particular decrease of flow of the blood and oxygen to the heart, in order to relax the patient. This allows to pass well the blood within the body’s patient).
Women and elderly patients are usually have atypical symptoms both at rest and during stress, often in the setting of nonobstructive coronary artery disease (CAD).
\[~\]
the ST/heart rate slope (ST/HR slope), is proposed as a more accurate ECG criterion for diagnosing significant coronary artery disease (CAD).
The most relevant rate slope is approximately between the first and second type (as flat and as down sloping) these are qualitative data that show how it is the type of heart rate frequence.
\[~\]
The most case is 0 major vessels interested, this means that they aren’t effect of any damaged. In other little case we can see that few vessels are damaged (around 1-2 vessels).
\[~\]
the maximum rate achieved is about a fixed defect.
\[~\]
Angina may feel like pressure in the chest, jaw or arm. It frequently may occur with exercise or stress. Some people with angina also report feeling lightheaded, overly tired, short of breath or nauseated.
As the heart pumps harder to keep up with what you are doing, it needs more oxygen-rich blood. This reflects to the fact that we have no exercise induced by angina.
\[~\]
\[~\]
\[~\]
Here, we illustrate the main features of the quantitative data after have seen the categorical data in the previous section:
\[~\]
hist.and.summary('age', 'Persons Age')
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 29.00 48.00 55.50 54.42 61.00 77.00
hist.and.summary('chol', 'Cholestoral')
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 126.0 211.0 240.5 246.5 274.8 564.0
hist.and.summary('thalachh', 'Maximum Heart Rate Achieved')
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 71.0 133.2 152.5 149.6 166.0 202.0
hist.and.summary('trtbps', 'Resting Blood Pressure')
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 94.0 120.0 130.0 131.6 140.0 200.0
hist.and.summary('oldpeak', 'Previous Peak')
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.800 1.043 1.600 6.200
\[~\]
As we can see above, the distributions of these data are different and several are more similar, for example if you compare the maximum heart rate and the age of the people these distributions are different! So, these considerations could be useful for predicting the heart attack.
After this, we consider also the densities considering the case of having the heart attack and not:
\[~\]
dense.chart('age', 'Persons Age')
dense.chart('chol', 'Cholestoral')
dense.chart('thalachh', 'Maximum Heart Rate Achieved')
dense.chart('trtbps', 'Resting Blood Pressure')
dense.chart('oldpeak', 'Previous Peak')
\[~\]
Here, we want to highlight the correlations between the features treated in qualitative, quantitative and without any distinction.
\[~\]
The correlations are measured considering the Pearson formula:
\[ \rho_{XY} = \frac{Cov(X,Y)}{\sigma_X \cdot \sigma_Y} \] Where:
With these correlation maps we see some interesting correlations is important to denote that the quantitative variables are more significant than the qualitative variables, because we are treating the quantitative measures, but here we can highlight and consider also the qualitative variables. So, here we denote in the complete map that the:
seems that these variables will be important in our predictions, we will see in few moment if it is in this way!
\[~\]
\[~\]
Likelihood \(\pi(y_{obs}|\theta)\): measures the goodness of fit of a statistical model to a sample of data for giving values of the unknown parameters. It is formed from the joint probability distribution of the sample, but viewed and used as a function of the parameters only, thus treating the random variables as fixed at the observed values.
Logistic regression model: Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary). Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables. This is more faster than other binary classifiers.
logit link function: it uses the Cumulative Distribution Function (CDF) of the logistic distribution. A benefit to use the Logit is that the coefficients can be interpreted in the terms of odds ratios.
cloglog function: Is unlike Logit and Probit asymmetric. It is best used when the probability of an event is very small or very large. The complementary log-log approaches 0 infinitely slower than any other link function. Cloglog model is closely related to continuous-time models for the occurrence of events, so it has an important application in the area of survival analysis and hazard modeling.
Prior \(\pi(\theta)\): is the probability distribution that would express one’s beliefs about this quantity before some evidence is taken into account.
the deviance: is a goodness of fit statistic for a statistical model; it is often used for statistical hypothesis testing. It is a generalization of the idea of using the sum of squares of residuals (RSS) in ordinary least squares to cases where model-fitting is achieved by maximum likelihood.
\[~\]
\[~\]
The main goal is to leverage the main fully Bayesian analysis, based on understanding if a person has a heart attack or not.
For testing the model in the prediction phase, we decide to split the dataset in train (80%) and test set (20%). In order to be able to see how much is better our model. I decided also to scale the values for having a well representation on the model with the scaled values to calculate easily them.
So, the principal components of the first model are:
The response variable \(Y_i\) (the “output” feature) that is the binary value that indicates if there is a heart attack or not, so it \(\in \{0,1\}\)
The predictor variables (\(x_i \in \mathbb{R}^{+}\)) that are chosen using the glm() function (we are cheating, but in this way we know before which are the right variables and at the same time we tried the frequentist approach), used to fit generalized linear models
As we can see below, if we consider all the variables within the Logistic Regression model, we see:
\[~\]
##
## Call:
## glm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 +
## x10 + x11 + x12 + x13, family = binomial(link = "logit"),
## data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5551 -0.4208 0.1495 0.6217 2.3968
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.52933 0.46303 3.303 0.000957 ***
## x1 0.10353 0.23347 0.443 0.657460
## x2 -1.75631 0.52136 -3.369 0.000755 ***
## x3 0.94949 0.21706 4.374 1.22e-05 ***
## x4 -0.33423 0.19701 -1.697 0.089785 .
## x5 -0.29725 0.21885 -1.358 0.174382
## x6 -0.01219 0.57688 -0.021 0.983141
## x7 0.26634 0.19699 1.352 0.176368
## x8 0.38210 0.25303 1.510 0.131015
## x9 -0.73391 0.46097 -1.592 0.111363
## x10 -0.58441 0.26382 -2.215 0.026750 *
## x11 0.43430 0.23503 1.848 0.064623 .
## x12 -0.80081 0.22693 -3.529 0.000417 ***
## x13 -0.46325 0.19466 -2.380 0.017321 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 333.16 on 240 degrees of freedom
## Residual deviance: 176.60 on 227 degrees of freedom
## AIC: 204.6
##
## Number of Fisher Scoring iterations: 6
\[~\]
As we can see above, the glm() function rejects the hypothesis to consider these variables:
… the intercept in this case is a good choice to admit it in our model.
An important remark to highlight is that we tried to do the frequentist approach of these data and we should remember that considering all of the features we achieved 204.6 as AIC value.
But, if we try with the features that we decide to consider in the next steps?
summary(glm(y ~ x2+x3+x4+x10+x11+x12+x13, family = binomial(link = "logit"),data=dat)) # frequentistic approach
##
## Call:
## glm(formula = y ~ x2 + x3 + x4 + x10 + x11 + x12 + x13, family = binomial(link = "logit"),
## data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6033 -0.5454 0.1627 0.6219 2.5898
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.2029 0.3984 3.020 0.002531 **
## x2 -1.5464 0.4557 -3.394 0.000689 ***
## x3 1.1094 0.2031 5.462 4.7e-08 ***
## x4 -0.3561 0.1874 -1.900 0.057444 .
## x10 -0.6548 0.2496 -2.623 0.008718 **
## x11 0.6082 0.2231 2.727 0.006397 **
## x12 -0.7987 0.2113 -3.780 0.000157 ***
## x13 -0.5084 0.1820 -2.794 0.005209 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 333.16 on 240 degrees of freedom
## Residual deviance: 187.04 on 233 degrees of freedom
## AIC: 203.04
##
## Number of Fisher Scoring iterations: 5
The behaviour is slightly changed, we have 203.04 as AIC value, we will see if we achieve the same result!
We considered these features, after manually checking which are the best to consider, we adopted the technique of feature engineering.
Then, we decided to consider another models and see the difference on which model at the end is the best model of our analysis.
Then, in summary we have N = 302 observations in our dataset well distributed.
\[~\]
\[~\]
In the first model, we decide to consider the following Logistic Regression model with the logit as link function:
\[ Y_i \sim Bern(logit(p_i)) \\ logit(p_i) = log\Big(\frac{p_i}{1-p_i}\Big)= \beta_{0} + \beta_{2} \cdot x_{2_i} + \beta_{3} \cdot x_{3_i} + \beta_{4} \cdot x_{4_i}+ \beta_{10} \cdot x_{10_i} + \beta_{11} \cdot x_{11_i} + \beta_{12} \cdot x_{12_i} + \beta_{13} \cdot x_{13_i} \] The prior beta parameters are chosen considering the \(\mu = 0\) and \(\tau^2 = 0.000001\). So, the beta prior parameters are distributed following a normal distribution as follows:
\[ \beta_i \sim N(\mu = 0, \tau^2 = 0.000001) \,\,\, for \,\, i \,\, = 0, 2,3,4,10,11,12,13 \]
The main reason is that we prefer chosing the binary classification as in this case the logistic regression to model our outcomes based only on 0 or 1.
\[~\]
\[~\]
As we can see, we implemented the model using RJags:
\[~\]
\[~\]
model <- function(){
# Likelihood
for (i in 1:N){
y[i] ~ dbern(p[i])
logit(p[i]) <- beta0 + beta2*x2[i] + beta3*x3[i] + beta4*x4[i] + beta10*x10[i] + beta11*x11[i] + beta12*x12[i] + beta13*x13[i] # omit intercept and not useful variables
}
# Defining the prior beta parameters
beta0 ~ dnorm(0, 1.0E-6)
beta2 ~ dnorm(0, 1.0E-6)
beta3 ~ dnorm(0, 1.0E-6)
beta4 ~ dnorm(0, 1.0E-6)
beta10 ~ dnorm(0, 1.0E-6)
beta11 ~ dnorm(0, 1.0E-6)
beta12 ~ dnorm(0, 1.0E-6)
beta13 ~ dnorm(0, 1.0E-6)
}
remark: that the standard deviation in RJags is considered as the precision, so the \(\tau^2 = \frac{1}{\sigma^{2}}\)
\[~\]
# Passing the data for RJags
data.jags <- list("y" = y, "N" = N,
"x2" = x2, "x3" = x3, "x4" = x4, "x10" = x10, "x11" = x11, "x12" = x12, "x13" = x13)
# Defining parameters of interest to show after running RJags
mod.params <- c("beta0", "beta2", "beta3", "beta4", "beta10", "beta11", "beta12", "beta13")
\[~\]
# Run JAGS
set.seed(123)
n.chains <- 3
mod.fit <- jags(data = data.jags, # DATA
model.file = model, # MODEL
parameters.to.save = mod.params, # TRACKING
n.chains = n.chains, n.iter = 10000, n.burnin = 1000, n.thin = 10) # MCMC
## module glm loaded
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 241
## Unobserved stochastic nodes: 8
## Total graph size: 2498
##
## Initializing model
mod.fit
## Inference for Bugs model at "C:/Users/Jeremy/AppData/Local/Temp/RtmpsXRjp9/model1da4264e1b93.txt", fit using jags,
## 3 chains, each with 10000 iterations (first 1000 discarded), n.thin = 10
## n.sims = 2700 iterations saved
## mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff
## beta0 1.253 0.404 0.499 0.971 1.237 1.527 2.090 1.004 620
## beta10 -0.690 0.257 -1.197 -0.859 -0.688 -0.518 -0.174 1.001 2700
## beta11 0.645 0.228 0.193 0.492 0.646 0.795 1.091 1.001 2700
## beta12 -0.851 0.217 -1.291 -0.994 -0.846 -0.705 -0.431 1.001 2700
## beta13 -0.531 0.191 -0.915 -0.662 -0.527 -0.402 -0.155 1.002 1900
## beta2 -1.621 0.463 -2.531 -1.921 -1.613 -1.311 -0.749 1.003 960
## beta3 1.173 0.211 0.781 1.031 1.164 1.310 1.607 1.001 2700
## beta4 -0.381 0.194 -0.765 -0.510 -0.377 -0.244 -0.016 1.001 2700
## deviance 195.262 4.190 189.226 192.285 194.602 197.414 205.675 1.001 2100
##
## For each parameter, n.eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
##
## DIC info (using the rule, pD = var(deviance)/2)
## pD = 8.8 and DIC = 204.0
## DIC is an estimate of expected predictive error (lower deviance is better).
\[~\]
In this last running of the MCMC we have interesting results:
mu.vect and sd.vect: these values representing the estimated values after simulating the MCMC.
the percentages (quantiles): represents the quantiles of the interval where the parameter is likely in
RHat: The values near 1 suggest convergence
n.eff: is the effective number of samples to achieve the convergence, the stationary region.
\[~\]
\[~\]
Here, we show the univariate trace-plots of the simulations of each parameter:
\[~\]
\[~\]
We are considering other interesting plots, underlining also each chain alone:
\[~\]
Now, we want to see the autocorrelations of each parameter:
## beta0 beta10 beta11 beta12 beta13
## Lag 0 1.0000000000 1.0000000000 1.00000000 1.0000000000 1.00000000
## Lag 10 0.0211107471 0.0029593247 0.01554344 0.0352270136 0.01384619
## Lag 50 -0.0004896374 -0.0100244593 -0.02281873 0.0095543446 0.03338720
## Lag 100 -0.0023794538 -0.0003204078 -0.03571999 -0.0047104210 0.02737051
## Lag 500 -0.0195576910 -0.0079951474 -0.03789510 -0.0005732452 0.01197219
## beta2 beta3 beta4 deviance
## Lag 0 1.00000000 1.000000000 1.000000000 1.000000000
## Lag 10 0.01203182 0.023922742 -0.044994798 -0.004000511
## Lag 50 -0.01877322 -0.008598919 -0.005844695 -0.002389403
## Lag 100 0.01472360 0.035265045 -0.023988021 0.005257172
## Lag 500 -0.03228712 -0.004412367 0.004901904 0.014990556
\[~\]
The ACFs follow good behaviours, because we need to have samples independent each other during new iterations. Here, we see that the correlations are going further from the first lag, this is strictly decreasing going close to 0. So, this is a good point!
We can conferm our intuition also with the values returned by autocorr.diag() that it returned the vector of the average autocorrelations across all chains.
\[~\]
\[~\] In this section we want to see the empirical average of \(\mathbf{\hat{I}}_{t}\) increasing the value of \(t \, = \, 1,...,T\). Before starting, we want to write the formula of \(\mathbf{\hat{I}}_{t}\) that is:
\[~\]
\[ \mathbf{I} \approx \mathbf{\hat{I}}_{t} = \frac{1}{T} \sum_{i=1}^{T} h(\theta^{(i)}) \]
\[~\]
is important to write \(\approx\) because this leverages the SLLN theorem and we want also to confirm this assumption! So, let’s see the results:
\[~\]
As we can see above each parameter achieves in all of the three chains generated the same end point, so means that with different initial points in these three chains, we are strictly going to have the same estimated mean parameter.
\[~\]
\[~\]
Here, we want to analyze the best approximation error for the MCMC sampling. We consider, essentially the square root of the MCSE.
The variance formula in the MCMC samplig is:
\[ \mathbf{V}[\hat{I}_{t}] = \frac{Var_{\pi}[h(X_{1})]}{t_{eff}} = \Big( 1 + 2 \sum_{k=1}^{\infty} \rho_{k}\Big)\frac{\sigma^{2}}{T} \]
where \(t_{eff}\) is the effective sample size (ESS). The idea is to have a sort of “exchange rate” between dependent and independent samples.
We can say, if we have 1,000 samples from a certain Markov chain are worth about as much as 70 independent samples because the MCMC samples are highly correlated.
Or if you have 1,000 samples from a different Markov chain are worth about as much as 400 independent samples because although the MCMC samples are dependent, they’re weakly correlated.
…More formal details on the slides.
As we can see the model 1 has these effective samples size for each parameters involved in this analysis:
## beta0 beta10 beta11 beta12 beta13 beta2 beta3 beta4
## 620 2700 2700 2700 1900 960 2700 2700
## deviance
## 2100
means that our samples, respect 10000 iterations, are weakly correlated.
…now let’s move on to calculate the MCSE
in this case we want to consider the MCSE that is the square root of the formula written above, So the results are (for each chain):
n <- length(colnames(mod.fit$BUGSoutput$sims.matrix))
mcse_dataframe <- data.frame(MCSE_jointly = rep(NA, n))
rownames(mcse_dataframe) <- colnames(mod.fit$BUGSoutput$sims.matrix)[1:n]
for(colname in colnames(mod.fit$BUGSoutput$sims.matrix)[1:n]){
mcse_dataframe[colname, "MCSE_jointly"] <- LaplacesDemon::MCSE(mod.fit$BUGSoutput$sims.matrix[ , colname])
}
mcse_dataframe
## MCSE_jointly
## beta0 0.007700067
## beta10 0.005366610
## beta11 0.004552970
## beta12 0.004224457
## beta13 0.003738720
## beta2 0.008647477
## beta3 0.003904511
## beta4 0.003802566
## deviance 0.081550028
\[~\]
As we can see the \(\beta_2\) has the highest approximation error considering the jointly chains.
\[~\]
\[~\]
The uncertainty is measured by using the variability of the parameter w.r.t. its absolute expectation:
## mean sd variability
## beta0 1.2534228 0.4036600 0.32204613
## beta10 -0.6899771 0.2569432 0.37239384
## beta11 0.6453638 0.2279989 0.35328734
## beta12 -0.8509458 0.2166542 0.25460396
## beta13 -0.5306761 0.1914158 0.36070167
## beta2 -1.6207944 0.4628305 0.28555781
## beta3 1.1726727 0.2108305 0.17978629
## beta4 -0.3811791 0.1937635 0.50832672
## deviance 195.2616418 4.1904971 0.02146093
The highest posterior uncertainty is about the \(\beta_4\).
\[~\]
\[~\]
Here, we bring the correlations between all the values calculated during the MCMC sampling, considering the joining matrix:
the highest correlation is between \(\beta_2 \text{ and } \beta_0\).
\[~\]
\[~\]
Here, we want to understand if we achieve with these multiple simulations of the markov chains the convergences and the validity of the stationarity regions. We are considering different tools, such as:
\[~\]
\[~\] This Introduces a MCMC diagnostic that estimates the number of iterations needed for a given level of precision in posterior samples.
coda::raftery.diag(coda.fit)
## [[1]]
##
## Quantile (q) = 0.025
## Accuracy (r) = +/- 0.005
## Probability (s) = 0.95
##
## You need a sample size of at least 3746 with these values of q, r and s
##
## [[2]]
##
## Quantile (q) = 0.025
## Accuracy (r) = +/- 0.005
## Probability (s) = 0.95
##
## You need a sample size of at least 3746 with these values of q, r and s
##
## [[3]]
##
## Quantile (q) = 0.025
## Accuracy (r) = +/- 0.005
## Probability (s) = 0.95
##
## You need a sample size of at least 3746 with these values of q, r and s
\[~\]
As we can see above, we need 3746 samples to achieve these performances in these different chains.
\[~\]
\[~\] This diagnostic is based on a test for equality of the means of the first and last part of a Markov chain (by default the first 10% and the last 50%). If the samples are drawn from a stationary distribution of the chain, then the two means are equal and Geweke’s statistic has an asymptotically standard normal distribution.
The test statistic is a standard Z-score: the difference between the two sample means divided by its estimated standard error. The standard error is estimated from the spectral density at zero, and so takes into account any autocorrelation.
The Z-score is calculated under the assumption that the two parts of the chain are asymptotically independent.
coda::geweke.diag(coda.fit)
## [[1]]
##
## Fraction in 1st window = 0.1
## Fraction in 2nd window = 0.5
##
## beta0 beta10 beta11 beta12 beta13 beta2 beta3 beta4
## 0.09202 -0.15363 -0.15142 -0.12666 0.96896 -0.55992 -1.81629 -0.08696
## deviance
## 1.04008
##
##
## [[2]]
##
## Fraction in 1st window = 0.1
## Fraction in 2nd window = 0.5
##
## beta0 beta10 beta11 beta12 beta13 beta2 beta3 beta4
## 0.84470 0.50344 0.69952 -0.68834 -0.31083 -1.09793 -0.05248 -0.59881
## deviance
## -1.68349
##
##
## [[3]]
##
## Fraction in 1st window = 0.1
## Fraction in 2nd window = 0.5
##
## beta0 beta10 beta11 beta12 beta13 beta2 beta3 beta4
## -0.95422 -0.22789 -0.45308 0.54138 1.32315 1.08537 -1.95547 -2.37299
## deviance
## -0.07728
\[~\]
\[~\]
The `potential scale reduction factor’ is calculated for each variable in x, together with upper and lower confidence limits. Approximate convergence is diagnosed when the upper limit is close to 1. For multivariate chains, a multivariate value is calculated that bounds above the potential scale reduction factor for any linear combination of the (possibly transformed) variables.
The confidence limits are based on the assumption that the stationary distribution of the variable under examination is normal. Hence the `transform’ parameter may be used to improve the normal approximation.
coda::gelman.diag(coda.fit)
## Potential scale reduction factors:
##
## Point est. Upper C.I.
## beta0 1.01 1.02
## beta10 1.00 1.00
## beta11 1.00 1.00
## beta12 1.00 1.00
## beta13 1.00 1.02
## beta2 1.00 1.01
## beta3 1.00 1.00
## beta4 1.00 1.00
## deviance 1.00 1.01
##
## Multivariate psrf
##
## 1.01
\[~\]
\[~\] This test implements a convergence diagnostic, and removes up to half the chain in order to ensure that the means are estimated from a chain that has converged.
The convergence test uses the Cramer-von-Mises statistic to test the null hypothesis that the sampled values come from a stationary distribution.
The test is successively applied, firstly to the whole chain, then after discarding the first 10%, 20%, … of the chain until either the null hypothesis is accepted, or 50% of the chain has been discarded.
The latter outcome constitutes `failure’ of the stationarity test and indicates that a longer MCMC run is needed. If the stationarity test is passed, the number of iterations to keep and the number to discard are reported.
The half-width test calculates a 95% confidence interval for the mean, using the portion of the chain which passed the stationarity test.
Half the width of this interval is compared with the estimate of the mean. If the ratio between the half-width and the mean is lower than eps, the halfwidth test is passed. Otherwise the length of the sample is deemed not long enough to estimate the mean with sufficient accuracy.
\[~\]
## [[1]]
##
## Stationarity start p-value
## test iteration
## beta0 passed 1 0.784
## beta10 passed 1 0.632
## beta11 passed 1 0.384
## beta12 passed 1 0.867
## beta13 passed 1 0.133
## beta2 passed 1 0.443
## beta3 passed 1 0.140
## beta4 passed 1 0.544
## deviance passed 1 0.263
##
## Halfwidth Mean Halfwidth
## test
## beta0 passed 1.238 0.0264
## beta10 passed -0.690 0.0167
## beta11 passed 0.647 0.0164
## beta12 passed -0.855 0.0144
## beta13 passed -0.539 0.0164
## beta2 passed -1.606 0.0302
## beta3 passed 1.174 0.0143
## beta4 passed -0.386 0.0113
## deviance passed 195.420 0.2838
##
## [[2]]
##
## Stationarity start p-value
## test iteration
## beta0 passed 1 0.851
## beta10 passed 1 0.766
## beta11 passed 1 0.690
## beta12 passed 1 0.779
## beta13 passed 1 0.864
## beta2 passed 1 0.898
## beta3 passed 1 0.856
## beta4 passed 1 0.505
## deviance passed 1 0.331
##
## Halfwidth Mean Halfwidth
## test
## beta0 passed 1.237 0.0283
## beta10 passed -0.696 0.0166
## beta11 passed 0.641 0.0149
## beta12 passed -0.850 0.0151
## beta13 passed -0.526 0.0127
## beta2 passed -1.606 0.0307
## beta3 passed 1.172 0.0135
## beta4 passed -0.382 0.0128
## deviance passed 195.258 0.2648
##
## [[3]]
##
## Stationarity start p-value
## test iteration
## beta0 passed 1 0.417
## beta10 passed 1 0.772
## beta11 passed 1 0.939
## beta12 passed 1 0.245
## beta13 passed 1 0.451
## beta2 passed 1 0.693
## beta3 passed 91 0.103
## beta4 passed 1 0.360
## deviance passed 1 0.435
##
## Halfwidth Mean Halfwidth
## test
## beta0 passed 1.286 0.0256
## beta10 passed -0.684 0.0145
## beta11 passed 0.649 0.0151
## beta12 passed -0.848 0.0137
## beta13 passed -0.526 0.0123
## beta2 passed -1.651 0.0297
## beta3 passed 1.175 0.0142
## beta4 passed -0.375 0.0125
## deviance passed 195.106 0.2724
\[~\]
We prefer considering more the Heidel test because it is a double check of the stationarity and the convergency states. This sometimes failed, meanwhile the other tools seems to be good, so I prefer considering more this test than the others just seen.
Here, we just have seen that all the tests of convergency and stationarity are passed, so our intuitions are correct!
\[~\]
\[~\]
After checking that the simulations pass the convergence, we decide to pool together all the chains and create the credible intervals and the point estimates for our beta values estimated.
These values will be useful also in the next sub chapters, these values are:
chainMat <- mod.fit$BUGSoutput$sims.matrix
# Point estimates
(beta.hat.jags <- colMeans(chainMat))
## beta0 beta10 beta11 beta12 beta13 beta2
## 1.2534228 -0.6899771 0.6453638 -0.8509458 -0.5306761 -1.6207944
## beta3 beta4 deviance
## 1.1726727 -0.3811791 195.2616418
# Credible Intervals
cred <- 0.95
(p.ET.jags <- apply(chainMat, 2, quantile,
prob=c((1-cred)/2, 1-(1-cred)/2)))
## beta0 beta10 beta11 beta12 beta13 beta2 beta3
## 2.5% 0.498899 -1.1972836 0.1927263 -1.2908820 -0.9145778 -2.530835 0.7805914
## 97.5% 2.090480 -0.1739558 1.0912389 -0.4309298 -0.1549915 -0.749096 1.6065815
## beta4 deviance
## 2.5% -0.76490900 189.2259
## 97.5% -0.01604922 205.6747
# What about the HPD?
(p.HPD.jags <- coda::HPDinterval(as.mcmc(chainMat)))
## lower upper
## beta0 0.4457538 2.02851382
## beta10 -1.1906330 -0.17137021
## beta11 0.2228514 1.11505456
## beta12 -1.2526543 -0.40877941
## beta13 -0.9092385 -0.15203867
## beta2 -2.5017979 -0.72419932
## beta3 0.7726020 1.59192938
## beta4 -0.7656510 -0.01544929
## deviance 188.6945028 203.85876831
## attr(,"Probability")
## [1] 0.95
\[~\]
It’s relevant to conclude that all your parameters (and its variables) are significant to predict heart attacks (except \(\beta_13\) that have the higher posterior uncertainty than the others parameters)
\[~\]
\[~\]
Here, we want to illustrate how using the markov chain to approximate the posterior predictive distribution of the people that could have the heart attack or not considering the features explained above. So, we want to approximate this area:
\[ m(y_{new}) = \int_{\Theta} f(y_{new}|\theta, Y_{old}, x_2, x_3, x_4, x_{10}, x_{11}, x_{12}, x_{13})\pi(\theta)d\theta \]
where the \(\theta\) is equal to the coefficients used and considered before, \(\Theta\) is the parameter space and \(\pi(\theta)\) is the product of the priors thanks to marginal independence assumption.
At the end, we want to smaple from the \(\pi(\theta)\) and from \(f(y_{new}|\theta, Y_{old}, x_2, x_3, x_4, x_{10}, x_{11}, x_{12}, x_{13})\).
Here, there is a brief practical example where we predict the observations that are in the test set:
# creating false N observations, trying to predict them (creating a false test set)
dat_test[c(2,3,4,5,6,7)] <- lapply(dat_test[c(2,3,4,5,6,7)], function(x) c(scale(x))) # scaling for predicting better than before
# Saving the estimated beta parameters
beta_est <- list("beta0" = mod.fit$BUGSoutput$summary["beta0", "mean"],
"beta2" = mod.fit$BUGSoutput$summary["beta2", "mean"],
"beta3" = mod.fit$BUGSoutput$summary["beta3", "mean"],
"beta4" = mod.fit$BUGSoutput$summary["beta4", "mean"],
"beta10" = mod.fit$BUGSoutput$summary["beta10", "mean"],
"beta11" = mod.fit$BUGSoutput$summary["beta11", "mean"],
"beta12" = mod.fit$BUGSoutput$summary["beta12", "mean"],
"beta13" = mod.fit$BUGSoutput$summary["beta13", "mean"])
# Iterating and found each predictive values
dat_test$output_predictive <- apply(dat_test, 1, function(x){
o <- beta_est$beta0 + beta_est$beta2*x["sex"] + beta_est$beta3*x["cp"] + beta_est$beta4*x["trtbps"] + beta_est$beta10*x["oldpeak"] + beta_est$beta11*x["slp"] + beta_est$beta12*x["caa"] + beta_est$beta13*x["thall"]
o_perc <- 1/(1+exp(-o))
y_pred <- rbinom(n = 10000, size = 1, prob = o_perc)
# get the most repeated value
predicted <- unique(y_pred)
predicted <- predicted[which.max(tabulate(match(y_pred, predicted)))]
return(predicted)
})
# Confusion matrix
conf_mtx <- as.data.frame(round(prop.table(table(dat_test$output, dat_test$output_predictive)), digits = 3))
hchart(conf_mtx, type = "heatmap", hcaes(x = Var1, y = Var2, value = Freq)) %>%
hc_title(text = "The Confusion Matrix") %>%
hc_plotOptions(
series = list(
borderColor = "#fcfbfa",
borderWidth = 1,
animation=(durtion=1000),
dataLabels = list(enabled = TRUE)
)) %>%
hc_xAxis(title=list(text="Actual Values")) %>%
hc_yAxis(title=list(text="Predicted Values") )
# Showing the accuracy and the error together
info_dt <- data.frame("Accuracy" = mean(dat_test$output == dat_test$output_predictive), "Error" = mean(dat_test$output != dat_test$output_predictive))
highchart() %>%
hc_chart(type = "column") %>%
hc_title(text = "The Classifier") %>%
hc_plotOptions(column = list(stacking = "normal")) %>%
hc_add_series(name="Accuracy of the Model",
data = info_dt$Accuracy,
stack = "Accuracy") %>%
hc_add_series(name="Error of the Model",
data = info_dt$Error,
stack = "Error") %>%
hc_chart(options3d=list(enabled=TRUE, alpha=2, beta=-10,
depth=100, viewDistance=25)) %>%
hc_plotOptions(column=list(depth= 100))
As we can see above, we get a great performance on predicting these values in the test set!
Now, let’s move to introduce a new model, in order to see if it’s better to change our actual model or not!
\[~\]
\[~\] Now, we want to compare the first model with another model, we consider the logistic regression but inserting a polynomial function \(cloglog(p_i) = \beta_{4_i} \cdot x_{4_i}^2 + (\beta_{5_i} \cdot \beta_{6_i}) \cdot (x_{5_i} + x_{6_i}) + \beta_{7_i} \cdot x_{7_i} + \beta_{8_i} \cdot x_{8_i} + \beta_{{13}_i} \cdot x_{{13}_i}\) in order to have the probability inserting randomly features and putting a new link function in this case the cloglog function.
Why this changement? Because, we want to see which model is better… and how can you see this? I’ll tell you which is the way to see which model is better for our data.
What is the link function? So, the link function transforms the probabilities of the levels of a categorical response variable to a continuous scale that is unbounded. Once the transformation is complete, the relationship between the predictors and the response can be modeled with the logistic regression for example.
As we can see, we will reproduce this second model changing the link function to see if it is better or not than the previous model:
\[~\]
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 241
## Unobserved stochastic nodes: 6
## Total graph size: 2694
##
## Initializing model
## Inference for Bugs model at "C:/Users/Jeremy/AppData/Local/Temp/RtmpsXRjp9/model1da471586646.txt", fit using jags,
## 3 chains, each with 10000 iterations (first 1000 discarded), n.thin = 10
## n.sims = 2700 iterations saved
## mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff
## beta13 -0.318 0.071 -0.454 -0.366 -0.320 -0.271 -0.174 1.001 2700
## beta4 -0.129 0.059 -0.253 -0.167 -0.127 -0.089 -0.022 1.001 2700
## beta5 134.976 248.710 -42.256 -0.024 0.096 174.474 773.329 1.523 8
## beta6 -1.367 7.801 -27.914 -0.065 0.000 0.012 10.151 1.082 300
## beta7 0.165 0.091 -0.016 0.108 0.164 0.227 0.334 1.002 1600
## beta8 0.598 0.105 0.398 0.527 0.597 0.664 0.809 1.000 2700
## deviance 278.457 3.165 274.329 276.141 277.780 280.054 286.110 1.001 2200
##
## For each parameter, n.eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
##
## DIC info (using the rule, pD = var(deviance)/2)
## pD = 5.0 and DIC = 283.5
## DIC is an estimate of expected predictive error (lower deviance is better).
\[~\]
Can you see something different? Seems yes.. see the DIC! Is better the previous model (although the difference is too tiny, but the previous model is better).
…mmh, what is the DIC?
\[~\]
\[~\]
As we can see above, we consider a different model changing the link function and considering a polynomial function model, the DIC (Deviance information criterion) is our indicator that indicates the better model.
The deviance information criterion (DIC) is a hierarchical modeling generalization of the Akaike information criterion (AIC). It is particularly useful in Bayesian model selection problems where the posterior distributions of the models have been obtained by Markov chain Monte Carlo (MCMC) simulation. DIC is an asymptotic approximation as the sample size becomes large, like AIC.
the DIC is calculated as:
\[ DIC = p_D + \overline{D(\theta)} \]
where:
The larger the effective number of parameters is, the easier it is for the model to fit the data, and so the deviance needs to be penalized.
Lower is the DIC value better is the accuracy of the model, in this case is better the first model.
I decided to insert also other criterions similar to the DIC, we will see in details what they represent.
\[~\]
\[~\]
The Akaike information criterion (AIC) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a means for model selection.
In estimating the amount of information lost by a model, AIC deals with the trade-off between the goodness of fit of the model and the simplicity of the model. In other words, AIC deals with both the risk of overfitting and the risk of underfitting.
The Akaike information criterion is defined as
\[ AIC(m) = D(\hat{\theta_m}, m) + 2 \cdot d_m \]
It is also a penalized deviance measure with penalty equal to 2 for each estimated parameter. \(d_m\) is the number of parameters considered in the model m.
In our case comparing the two models we have these values of AIC:
## The AIC for the FIRST model is: 211.261641833565
## The AIC for the SECOND model is: 290.456768891642
\[~\]
\[~\]
the Bayesian information criterion (BIC) is a criterion for model selection among a finite set of models; the model with the lowest BIC is preferred. It is based, in part, on the likelihood function and it is closely related to the Akaike information criterion (AIC).
When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may result in overfitting. BIC introduces a penalty term for the number of parameters in the model; the penalty term is larger in BIC than in AIC.
The Bayesian information criterion is defined as
\[ BIC(m) = D(\hat{\theta_m}, m) + log(n) \cdot d_m \]
Where:
n is the number of observations considered
\(d_m\) is the number of coefficients considered.
\(D(\hat{\theta_m}, m)\) is the deviance of the model
In our case comparing the two models we have these values of AIC:
## The BIC for the FIRST model is: 239.14001730149
## The BIC for the SECOND model is: 311.365550492586
\[~\]
\[~\]
Here we considered all the criterions together:
## DIC AIC BIC
## 1 204.040 211.2616 239.1400
## 2 283.466 290.4568 311.3656
As we can see, the values are pretty similar. They are only used as indicator to see how much is better the model with our data.
\[~\]
\[~\] We try to plot the logistic regression using the parameters obtained in the first model, we compare the line for each parameter-response variable:
We prefer considering to show only the significant ones, in terms of visualization.
\[~\]
\[~\] Now, we understood that the first model is better than the second. So, for checking that the model proposed can correctly recover the model parameters. I Executed the simulation considering the data simulated from the model proposed, the beta parameters checked are the estimated from the first model, considering these values we can continue with our last purpose:
## module glm loaded
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 302
## Unobserved stochastic nodes: 8
## Total graph size: 3117
##
## Initializing model
## Inference for Bugs model at "C:/Users/Jeremy/AppData/Local/Temp/RtmpUvKNH7/model588c6e09158b.txt", fit using jags,
## 3 chains, each with 10000 iterations (first 1000 discarded), n.thin = 10
## n.sims = 2700 iterations saved
## mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff
## beta0 1.428 0.338 0.781 1.196 1.428 1.653 2.089 1.001 2700
## beta10 -0.853 0.185 -1.219 -0.974 -0.852 -0.729 -0.507 1.001 2700
## beta11 0.464 0.156 0.152 0.360 0.461 0.571 0.759 1.001 2700
## beta12 -0.828 0.174 -1.182 -0.940 -0.828 -0.710 -0.502 1.001 2700
## beta13 -0.882 0.171 -1.220 -0.996 -0.874 -0.762 -0.563 1.001 2000
## beta2 -1.867 0.394 -2.616 -2.143 -1.865 -1.600 -1.095 1.001 2700
## beta3 1.248 0.180 0.912 1.126 1.244 1.362 1.609 1.000 2700
## beta4 -0.189 0.146 -0.482 -0.287 -0.186 -0.093 0.097 1.001 2700
## deviance 283.847 3.991 278.102 280.979 283.168 285.995 293.393 1.001 2700
##
## For each parameter, n.eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
##
## DIC info (using the rule, pD = var(deviance)/2)
## pD = 8.0 and DIC = 291.8
## DIC is an estimate of expected predictive error (lower deviance is better).
\[~\]
The values are slighlty the same, so the parameters are recovered partially well, as we can see below:
## beta0 beta10 beta11 beta12 beta13 beta2 beta3 beta4
## Estimated 1.253 -0.69 0.645 -0.851 -0.531 -1.621 1.173 -0.381
## True 1.428 -0.853 0.464 -0.828 -0.882 -1.867 1.248 -0.189
## 2.5% 0.499 -1.197 0.193 -1.291 -0.915 -2.531 0.781 -0.765
## 97.5% 2.09 -0.174 1.091 -0.431 -0.155 -0.749 1.607 -0.016
## Recovered YES YES YES YES YES YES YES YES
\[~\]
\[~\]
Here, I pointed different new improvements that could be done in a future time:
\[~\]
\[~\]
We saw the power and the weakness of Bayesian Analysis.
At first step we tried to face with the model on jags, “cheating” a little bit on which independent variables are maybe good for our purposes and after making our usual considerations, we compared with another model not so much good.. of course many little things could be improved in the future in this amazing analysis.
So thanks for all!
\[~\]
\[~\]
Heart Attack Analysis & Prediction Dataset - [Kaggle] https://www.kaggle.com/rashikrahmanpritom/heart-attack-analysis-prediction-dataset
For graphical plots - [Kaggle] https://www.kaggle.com/chaitanya99/heart-attack-analysis-prediction/data
Link functions, the model types - [Alteryx] https://community.alteryx.com/t5/Alteryx-Designer-Knowledge-Base/Selecting-a-Logistic-Regression-Model-Type-Logit-Probit-or/ta-p/111269
HighCharter, for the amazing plots in R - [High Charter] https://jkunst.com/highcharter/articles/hchart.html
Coronary artery diasease - [PubMed] https://pubmed.ncbi.nlm.nih.gov/3739881
Nitroglycerin - [Wikipedia] https://en.wikipedia.org/wiki/Nitroglycerin
Vessel Diases - [Digirad] https://www.digirad.com/triple-vessel-disease-diagnosis-treatment/
Different types of Angina - [CardioSmart] https://www.cardiosmart.org/topics/angina
Logistic Regression Bayesian Model R - [BayesBall] https://bayesball.github.io/BOOK/bayesian-multiple-regression-and-logistic-models.html
Typical Angina - [NCBI] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5680106
RDocumentations - [RDocumention] https://www.rdocumentation.org
Spectrum of ECG changes during exercise testing - [SomePoMed] https://somepomed.org/articulos/contents/mobipreview.htm?13/38/13930
\[~\]
\[~\]